18 research outputs found
Speech Enhancement Using Speech Synthesis Techniques
Traditional speech enhancement systems reduce noise by modifying the noisy signal to make it more like a clean signal, which suffers from two problems: under-suppression of noise and over-suppression of speech. These problems create distortions in enhanced speech and hurt the quality of the enhanced signal. We propose to utilize speech synthesis techniques for a higher quality speech enhancement system. Synthesizing clean speech based on the noisy signal could produce outputs that are both noise-free and high quality. We first show that we can replace the noisy speech with its clean resynthesis from a previously recorded clean speech dictionary from the same speaker (concatenative resynthesis). Next, we show that using a speech synthesizer (vocoder) we can create a clean resynthesis of the noisy speech for more than one speaker. We term this parametric resynthesis (PR). PR can generate better prosody from noisy speech than a TTS system which uses textual information only. Additionally, we can use the high quality speech generation capability of neural vocoders for better quality speech enhancement. When trained on data from enough speakers, these vocoders can generate speech from unseen speakers, both male, and female, with similar quality as seen speakers in training. Finally, we show that using neural vocoders we can achieve better objective signal and overall quality than the state-of-the-art speech enhancement systems and better subjective quality than an oracle mask-based system
SpeechLMScore: Evaluating speech generation using speech language model
While human evaluation is the most reliable metric for evaluating speech
generation systems, it is generally costly and time-consuming. Previous studies
on automatic speech quality assessment address the problem by predicting human
evaluation scores with machine learning models. However, they rely on
supervised learning and thus suffer from high annotation costs and domain-shift
problems. We propose SpeechLMScore, an unsupervised metric to evaluate
generated speech using a speech-language model. SpeechLMScore computes the
average log-probability of a speech signal by mapping it into discrete tokens
and measures the average probability of generating the sequence of tokens.
Therefore, it does not require human annotation and is a highly scalable
framework. Evaluation results demonstrate that the proposed metric shows a
promising correlation with human evaluation scores on different speech
generation tasks including voice conversion, text-to-speech, and speech
enhancement
Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks
We propose a decoder-only language model, \textit{VoxtLM}, that can perform
four tasks: speech recognition, speech synthesis, text generation, and speech
continuation. VoxtLM integrates text vocabulary with discrete speech tokens
from self-supervised speech features and uses special tokens to enable
multitask learning. Compared to a single-task model, VoxtLM exhibits a
significant improvement in speech synthesis, with improvements in both speech
intelligibility from 28.9 to 5.6 and objective quality from 2.68 to 3.90.
VoxtLM also improves speech generation and speech recognition performance over
the single-task counterpart. VoxtLM is trained with publicly available data and
training recipes and model checkpoints will be open-sourced to make fully
reproducible work
Unsupervised Data Selection for TTS: Using Arabic Broadcast News as a Case Study
Several high-resource Text to Speech (TTS) systems currently produce natural,
well-established human-like speech. In contrast, low-resource languages,
including Arabic, have very limited TTS systems due to the lack of resources.
We propose a fully unsupervised method for building TTS, including automatic
data selection and pre-training/fine-tuning strategies for TTS training, using
broadcast news as a case study. We show how careful selection of data, yet
smaller amounts, can improve the efficiency of TTS system in generating more
natural speech than a system trained on a bigger dataset. We adopt to propose
different approaches for the: 1) data: we applied automatic annotations using
DNSMOS, automatic vowelization, and automatic speech recognition (ASR) for
fixing transcriptions' errors; 2) model: we used transfer learning from
high-resource language in TTS model and fine-tuned it with one hour broadcast
recording then we used this model to guide a FastSpeech2-based Conformer model
for duration. Our objective evaluation shows 3.9% character error rate (CER),
while the groundtruth has 1.3% CER. As for the subjective evaluation, where 1
is bad and 5 is excellent, our FastSpeech2-based Conformer model achieved a
mean opinion score (MOS) of 4.4 for intelligibility and 4.2 for naturalness,
where many annotators recognized the voice of the broadcaster, which proves the
effectiveness of our proposed unsupervised method
EEND-SS: Joint End-to-End Neural Speaker Diarization and Speech Separation for Flexible Number of Speakers
In this paper, we present a novel framework that jointly performs speaker
diarization, speech separation, and speaker counting. Our proposed method
combines end-to-end speaker diarization and speech separation methods, namely,
End-to-End Neural Speaker Diarization with Encoder-Decoder-based Attractor
calculation (EEND-EDA) and the Convolutional Time-domain Audio Separation
Network (ConvTasNet) as multi-tasking joint model. We also propose the multiple
1x1 convolutional layer architecture for estimating the separation masks
corresponding to the number of speakers, and a post-processing technique for
refining the separated speech signal with speech activity. Experiments using
LibriMix dataset show that our proposed method outperforms the baselines in
terms of diarization and separation performance for both fixed and flexible
numbers of speakers, as well as speaker counting performance for flexible
numbers of speakers. All materials will be open-sourced and reproducible in
ESPnet toolkit.Comment: submitted to INTERSPEECH 202
Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data
Pre-training speech models on large volumes of data has achieved remarkable
success. OpenAI Whisper is a multilingual multitask model trained on 680k hours
of supervised speech data. It generalizes well to various speech recognition
and translation benchmarks even in a zero-shot setup. However, the full
pipeline for developing such models (from data collection to training) is not
publicly accessible, which makes it difficult for researchers to further
improve its performance and address training-related issues such as efficiency,
robustness, fairness, and bias. This work presents an Open Whisper-style Speech
Model (OWSM), which reproduces Whisper-style training using an open-source
toolkit and publicly available data. OWSM even supports more translation
directions and can be more efficient to train. We will publicly release all
scripts used for data preparation, training, inference, and scoring as well as
pre-trained models and training logs to promote open science.Comment: Accepted at ASRU 202